Mid-gestation fetal cortex dataset: Batch correction and Dimensionality reduction¶

Upstream Steps

  • QC filter on cells
  • expression filter on genes
  • Normalization and log10 transformation by Scanpy
  • HVG by Triku

This notebook

  • Integration by Harmony
  • Dimensionality reduction by
    • UMAP
    • Diffusion Map
    • Force-Directed Graph

1. Environment Set Up¶

1.1 Library upload¶

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import igraph as ig
import matplotlib.pyplot as plt
from datetime import datetime
from scipy.sparse import csr_matrix, isspmatrix

import scanpy as sc
import scanpy.external as sce

1.2 Settings¶

In [2]:
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=80)
In [3]:
#results_file = '/home/..../brainomics/Dati/3_AdataDimRed.h5ad'
results_file = '/group/brainomics/Intermediate/3_AdataDimRed.h5ad'

1.3 Start Computation time¶

In [4]:
print(datetime.now())
2022-11-25 19:42:24.002373

2. Read input files¶

In [5]:
adata = sc.read('/group/brainomics/Intermediate/2_AdataNorm.h5ad')
In [6]:
print('Loaded Normalizes AnnData object: number of cells', adata.n_obs)
print('Loaded Normalizes AnnData object: number of genes', adata.n_vars)

print('Available metadata for each cell: ', adata.obs.columns)
Loaded Normalizes AnnData object: number of cells 27457
Loaded Normalizes AnnData object: number of genes 13945
Available metadata for each cell:  Index(['Cluster', 'Subcluster', 'Donor', 'Layer', 'Gestation_week', 'Index',
       'Library', 'Number_genes_detected', 'Number_UMI',
       'Percentage_mitochondrial', 'S_phase_score', 'G2M_phase_score', 'Phase',
       'n_genes_by_counts', 'total_counts', 'total_counts_mito',
       'pct_counts_mito', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes',
       'n_counts'],
      dtype='object')

3. Check for batch effects¶

In [7]:
adata_check = adata.copy()
In [8]:
sc.pp.pca(adata_check, n_comps=50, use_highly_variable=True, svd_solver='arpack')
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:01)
In [9]:
sc.pp.neighbors(adata_check, n_neighbors=30, n_pcs=25)
computing neighbors
    using 'X_pca' with n_pcs = 25
2022-11-25 19:42:35.169676: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-25 19:42:35.295165: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-25 19:42:36.325841: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/R/4.2.1/lib/R/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2022-11-25 19:42:36.326018: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/R/4.2.1/lib/R/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
2022-11-25 19:42:36.326050: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
    finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:31)
In [10]:
sc.tl.umap(adata_check)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:21)
In [11]:
sc.pl.umap(adata_check, color=['Donor', 'Layer'], size=10)
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
In [12]:
plt.rcParams['figure.dpi'] = 120
sc.pl.umap(adata_check, color=['Donor'], size=10)
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
In [13]:
del adata_check

4. Integration with Harmony¶

NOTE: CORRECTING FOR 'Donor'

From the UMAP, it is visible a batch effect related to the donor identity, particularly in the region of migrating neurons. We therefore apply a batch correction using Harmony algorithm and

4.1 Calculate PCA¶

In [14]:
sc.pp.pca(adata, n_comps=50, use_highly_variable=True, svd_solver='arpack')
computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:01)
In [15]:
sc.settings.set_figure_params(dpi=80)
sc.pl.pca_variance_ratio(adata)
In [16]:
sc.pl.pca_variance_ratio(adata, log=True)

4.2 Performing correction with Harmony¶

Harmony uses an iterative clustering approach to align cells from different batches. For each iteration:

  • soft k-means clustering to group cells from multiple datasets
  • computation for each cluster of a global centroid and a dataset-specific centroid
  • from the centroids, calculation of a correction factor
  • the correction factor is used to correct each cell with a cell-specific factor

References

  • Harmony Quick Start
  • Harmony implementation in python
  • Harmony GitHub

image.png

In [17]:
sc.external.pp.harmony_integrate(adata, 'Donor')
2022-11-25 19:43:34,232 - harmonypy - INFO - Iteration 1 of 10
INFO:harmonypy:Iteration 1 of 10
2022-11-25 19:43:41,392 - harmonypy - INFO - Iteration 2 of 10
INFO:harmonypy:Iteration 2 of 10
2022-11-25 19:43:48,542 - harmonypy - INFO - Iteration 3 of 10
INFO:harmonypy:Iteration 3 of 10
2022-11-25 19:43:55,935 - harmonypy - INFO - Iteration 4 of 10
INFO:harmonypy:Iteration 4 of 10
2022-11-25 19:44:03,160 - harmonypy - INFO - Iteration 5 of 10
INFO:harmonypy:Iteration 5 of 10
2022-11-25 19:44:07,525 - harmonypy - INFO - Iteration 6 of 10
INFO:harmonypy:Iteration 6 of 10
2022-11-25 19:44:10,435 - harmonypy - INFO - Converged after 6 iterations
INFO:harmonypy:Converged after 6 iterations

4.3 Repeat plot with batch-corrected PCA¶

New corrected PC are saved in .obs["X_pca_harmony"]

In [18]:
sc.pl.embedding(adata, basis="X_pca_harmony", color=['Donor'])
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

5. Calculate dimensionality reductions¶

5.1 Compute neighborhood graph¶

We compute the neighborhood graph of cells using the harmony-corrected PCA representation of the data. This identifies how similar a cell is to another cell, definying cells that are close from those that are not.

This step is propedeutic for UMAP plotting and for clustering.

Key parameters:

  • n_pcs: number of PC used for compute the kNN graph
  • n_neighbors: number of neighbors. Larger neighbor values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default.
  • metrics: distance metric used in the calculation
In [19]:
sc.pp.neighbors(adata, n_neighbors=80, n_pcs=20, use_rep="X_pca_harmony", key_added="harmony")
#sc.pp.neighbors(adata, n_neighbors=80, n_pcs=15, use_rep="X_pca_harmony", key_added="harmony", metric='cosine')
computing neighbors
    finished: added to `.uns['harmony']`
    `.obsp['harmony_distances']`, distances for each pair of neighbors
    `.obsp['harmony_connectivities']`, weighted adjacency matrix (0:00:17)

5.2 Calculate the UMAP¶

In [20]:
sc.tl.umap(adata, neighbors_key="harmony", random_state=1)
computing UMAP
    finished: added
    'X_umap', UMAP coordinates (adata.obsm) (0:00:29)

5.3 Calculate the Diffusion Map¶

In [21]:
sc.tl.diffmap(adata, neighbors_key="harmony")
computing Diffusion Maps using n_comps=15(=n_dcs)
computing transitions
    finished (0:00:00)
    eigenvalues of transition matrix
    [1.         0.99714315 0.99270475 0.99232537 0.9917588  0.9895217
     0.9785628  0.9751901  0.97225225 0.9687998  0.9620149  0.9564335
     0.944027   0.93468106 0.9310887 ]
    finished: added
    'X_diffmap', diffmap coordinates (adata.obsm)
    'diffmap_evals', eigenvalues of transition matrix (adata.uns) (0:00:02)

5.4 Calculate the Force-directed graph¶

In [22]:
#Takes quite some time 
sc.tl.draw_graph(adata, neighbors_key="harmony")
drawing single-cell graph using layout 'fa'
    finished: added
    'X_draw_graph_fa', graph_drawing coordinates (adata.obsm) (0:05:06)

6. Visualize UMAP: Diagnostics and Metadata¶

In [23]:
sc.pl.umap(adata, color=['n_genes_by_counts',"total_counts", 'pct_counts_mito', 'pct_counts_ribo'])
In [24]:
sc.settings.set_figure_params(dpi=120)
sc.pl.umap(adata, color=['Donor', 'Cluster'])
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

7. Visualize Diffusion Map: Diagnostics and Metadata¶

In [25]:
sc.settings.set_figure_params(dpi=80)
sc.pl.diffmap(adata, color=['n_genes_by_counts',"total_counts", 'pct_counts_mito', 'pct_counts_ribo'], 
             components=['2,3'])
In [26]:
sc.settings.set_figure_params(dpi=120)
sc.pl.diffmap(adata, color=['Donor', 'Cluster'], components=['2,3'])
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

8. Visualize Force-directed graph: Diagnostics and Metadata¶

In [27]:
sc.settings.set_figure_params(dpi=80)
sc.pl.draw_graph(adata, color=['n_genes_by_counts',"total_counts", 'pct_counts_mito', 'pct_counts_ribo'])
In [28]:
sc.settings.set_figure_params(dpi=120)
sc.pl.draw_graph(adata, color=['Donor', 'Cluster'])
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/usr/local/lib/python3.8/dist-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

9. Visualize UMAP: Cell population markers¶

9.1 Functions for plotting markers in UMAP¶

In [29]:
def selectMarkers(adataObj, mList):
    """  From a list of gene names select only the genes that are present in adata.var
    """
    #Select markers present in adata
    p = adataObj.var_names[adataObj.var_names.isin(mList) == True]
    #Keep the same order as input list
    p = [x for x in mList if x in p]   
    
    #Select missing genes
    ab = set(mList).difference(set(adataObj.var_names))
    
    #Print message
    if len(ab) == len(mList):
        print('\nAll markers are missing')
    else:
        print('\nThe following marker genes are missing: ', ab)
        
    return(p)
In [30]:
def CustomUmap(adata, genes):
    genes = selectMarkers(adata, genes)
    sc.pl.umap(adata, color=genes, size=10, frameon=False,
               vmin='p1',  vmax='p99', layer='lognormcounts')

9.2 Top Triku genes¶

In [31]:
TopTriku = adata.var.sort_values(by=['triku_distance'], ascending=False).head(16).index
CustomUmap(adata, TopTriku)
The following marker genes are missing:  set()

9.3 Dictionary of marker genes¶

In [32]:
marker_dictionary = {
    'Proliferation': ['MKI67', 'CDC20', 'HMGB2', 'CCNB1'], 
    'Apical Progenitors': ['FABP7', 'GLI3', 'PAX6', 'NES', 'SOX1', 'SOX2', 'SOX9', 'VIM'], 
    'Intermediate Progenitors': ['EOMES', 'ELAVL4', 'NHLH1', 'KCNQ3'], 
    'Pan-neuronal': ['DCX', 'STMN2', 'SYT1', 'SYP', 'GAP43', 'MEF2C',  'FOXG1'], 
    'ExcitatoryNeurons': ['EMX1', 'NEUROD1', 'NEUROD2', 'NEUROD6', 'NEUROG2', 'SLC17A7', 'TBR1', 'BCL11A', 'CUX1', 'CUX2', 'SATB2'], 
    'Inhibitory Neurons': ['DLX5', 'DLX6', 'DLX6-AS1', 'GAD1', 'GAD2', 'NKX2.1', 'SST', 'NPY', 'CALB1', 'CALB2', 'VIP', 'PVALB'], 
    'oRG': ['FAM107A', 'HOPX',  'PTPRZ1', 'TNC'], 
    'Astrocytes': ['GFAP', 'SLC1A3',  'S100B', 'AQP4', 'FGFR3', 'ALDH1L1']
    
    }

9.4 Marker gene plotting¶

In [33]:
for Pop in marker_dictionary: 
    print("\n\n {}".format(Pop)) 
    CustomUmap(adata, marker_dictionary[Pop])

 Proliferation

The following marker genes are missing:  set()

 Apical Progenitors

The following marker genes are missing:  set()

 Intermediate Progenitors

The following marker genes are missing:  set()

 Pan-neuronal

The following marker genes are missing:  set()

 ExcitatoryNeurons

The following marker genes are missing:  set()

 Inhibitory Neurons

The following marker genes are missing:  {'CALB1', 'VIP', 'NKX2.1', 'PVALB'}

 oRG

The following marker genes are missing:  set()

 Astrocytes

The following marker genes are missing:  {'AQP4', 'ALDH1L1'}

10. Save¶

10.1 Save AData¶

In [34]:
adata.write(results_file)

10.2 Timestamp finished computations¶

In [35]:
print(datetime.now())
2022-11-25 19:50:37.779670

10.3 Save python and html versions¶

In [36]:
nb_fname = '3_DimRed'
In [37]:
%%bash -s "$nb_fname"
jupyter nbconvert "$1".ipynb --to="python"
jupyter nbconvert "$1".ipynb --to="html"
[NbConvertApp] Converting notebook 3_DimRed.ipynb to python
[NbConvertApp] Writing 8233 bytes to 3_DimRed.py
[NbConvertApp] Converting notebook 3_DimRed.ipynb to html
[NbConvertApp] Writing 2792611 bytes to 3_DimRed.html
In [ ]: